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Abstract 



^J I The G-Wishart distribution is the conjugate prior for precision matrices that encode 



the conditional independencies of a Gaussian graphical model. While the distribution 



C^ . has received considerable attention, posterior inference has proven computationally 

challenging, in part due to the lack of a direct sampler. In this note, we rectify this sit- 
uation. The existence of a direct sampler offers a host of new posibilities for the use of 
G-Wishart variates. We discuss one such development by outlining a new transdimen- 



*^ ! sional model search algorithm-which we term double reversible jump-that leverages 

this sampler to avoid normalizing constant calculation when comparing graphical mod- 
^T . els. We conclude with two short studies meant to investigate our algorithm's validity. 
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1 Introduction 



Jones et al. 



The G aussian graphical model (GGM) has received widespread consideration (see 
20051 ) and estimat ors obeyin g graph ical constraints in standard Gaussian sampling were pro- 
posed as early as iDempsterl (1l972l ). Initial in corporation of GGMs in Bay esian estimation 
has largely focused on decomposable graphs ( jDawid and Lauritzenl. Il993l). since prior dis- 
tributions factorize into products of Wishart distributions. iRoveratd ( 120021 ) generalizes the 
Hyper-Inverse Wishart distribution to arbitrary graphs and, by consequence, sp ecifies a 
conjugate prior for sparse precision matrices K. lAtay-Kayis and Massanj (120051 ) further 



develop this prior and outline a Monte Carlo (MC ) meth od t hat enables the computa - 
tion of Bayes factor s. Fol 



owmg 



Letac and MassamI (120071 ) and 



Rajaratnam et al.l ( l2008l ). 



Lenkoski and Dobral (1201 ll ) term this distribution the G- Wishart, and propose computa- 
tional improvements to direct model comparison and model search. 

The desire to embed the G- Wishart distribution in more complicated hierarchical frameworks- 
particularly those involv i ng latent Gaussianity-exposed difficulties with the M C approxi- 



mation (see 



Dobra et al 



2011 : 



Wang and Li 



2012 



Cheng and Lenkoskil. 



2012, for discus- 



sion). These difficulties were partly related to numerical instability ( IWang and Lil . |2012| ). 
but were also methodological, as a realization of K was needed from the current model in 
order to update other hierarchical parameters (jCheng and Lenkoskil . 120121 ). At the t ime a 



host of Markov chain Monte Carlo (^MCMC) meth o ds had been propo sed (jPiccioni 



Mitsakakis et al. 



2011 



cept /reject sampler (IWang and Carvalho 



2000 



iDobra and Lenkoski. 



2011 



Dobra et al. 



2010|), which 



20 



Dobra et al. 



1) as well as an ac- 



(l201l[ ) shows suffers 



from very low acceptance probabilities even in moderate dimensional problems. Despite 
these developments no way of reliably sampling directly from a G- Wishart distribution has 
been proposed. 



We recti 



of 



y thi s situation. Our direct sampler is quite similar to the block Gibbs sampler 



Piccionil (120001 ) and involves sampling a standard Wi shart varia t e from a full model and 



using the iterative proportional scaling (IPS) algorithm ( iDempsterl . 1 19721 ) to then place this 
variate in the corre c t spa ce. Our approach differs critically, however, from the block Gibbs 
sampler of lPiccionil ( l2000l ) in that sampling occurs first, and independently of previous sam- 



ples, with the subsequent application of the IPS algorithm relative to a fixed target. 

The existence of a direct sampler considerably expands the usefulness of the G- Wishart 
distribution. We provide one example of this, by proposing a new method of moving through 



the sp ace o f GGMs. T h e reve rsible jump algorithms developed in iDobra and Lenkoski 



(J2OIII ). and iDobra et al.l ( 1201 ll ) provided a me ans of model averaging K in the context 
of more involved Bayesian models. As noted by IWang and Lil (J2012|), these approaches still 
require the use of unstable MC approximation of prior normalizing constants. With a direct 
sampler, we are now able to resolve this issue by proposing a n ew transdimensional algo- 
rithm that combines the concept behi nd the exchange algorithm (JMurray et al.l . |2006[ ) with 
reversible jump MCMC (Green], 119951 ). which we call double reversible jump. 



The article is organized as follows. In Section [2] we review the G-Wishart distribution, 
and propose the direct sampler. Section [3] develops the new double reversible jump algo- 
rithm. In Section H] we provide two short examples meant to confirm the validity of our new 
approach. We conclude in Section |5l 

2 The G-Wishart Distribution 



2.1 Basic Properties 

Suppose that we collect data V = {Z^^\ ..., Z^")} such that Z^^^ r^ Mp{0,K-^) indepen- 
dently for j G {1, . . . ,n}, where K G Pp, the space oi p x p symmeteric positive definite 
matrices. This sample has likelihood 

pr{V\K) = (27r)-"^/2|Kr/2exp (^-^{K,U)^ , 

where (^4, B) = tr{A'B) denotes the trace inner product and U = X]r=i Z^ Z ■ 

Further suppose that G = (V,E) is a conditional in dependence graph where V = 
{1, . . . ,p} and E G V xV. As in lCheng and Lenkoski! (J2012l ). we will slightly abuse notation 
throughout, by writing {i,j) G G to indicate that the edge {i,j) is in the edge set E. Asso- 
ciated with G is a subspace P^ C Pp such that K G Fg iraplies that K G Pp and Kg = 



when ever {i,j) G. The G-Wishart distribution (JRoveratd . 120021 : lAtay-Kayis and Massaml . 
20051 ) Wg(5, D) assigns probability to K G Pg as 



pr{K\6,D,G) 



Ig{S,D) 



|XK''-2)/2 



--{K,D)]lKeFa- 



This distribution is conjugate ( jRoveratd . 120021 ) and thus 



pr{K\6, D, G, V) = WaiS + n,D + U). 



Let C = {Ci, . . . , Cj} be a clique decomposition of the graph G. For our purposes we 
assume that this decomposition is maximally complete. We thus have that 

for each j G {1, . . . , J} and K eFq- We define the function Bc^^-) by 



Then given K ~ Wg(5, D) and any clique Cj of the graph G, iRoveratd (120021 ) proves that 



Kc,\K \ Kc, ~ W{6, Dc,,Bc^{K \ KcJ), (1) 

where, in general we write K ~ VV((5, D, B) to denote any matrix for which K — B r^ 
W((5, D). Equation ([1]) thereby gives the conditional distributions for an overlapping pariti- 
tion of E and proves critical to the developments below. 

2.2 Iterative Proportional Scaling and Block Gibbs Sampling 

As above, let Cj be one of the cliques of G. For A G P|c j define the transformation 

Tc,A : Pg ^ Pg (2) 



where 



while 



[Tc,,a{K)]c, =A + Bc^ {K \ Kc, 
[Tc,AK)]ik = Ki, 



if either / or k are not in Cj . 



Lenkoski and Dobral (J201ll ) use ([2]) to determine 



k^ = argmax [Kl^^-^^/^exp (--{K,D)]l 



K&G 



via an algorithm known as Iterative Proportional Scaling (IPS), following the work of 
Dempsterl ( 19721 ). The IPS algorithm works by constructing a chain K^^\K^'^\... such 
that K^^' = Ip and K^"^' is determined from K^^^'^' through the update 



K'-'^ = T, 



Cj,D, 



-1 o . . . o 



Tc.^n-AK^'-'^) 



eventually K^'^' coverges to K , see iLauritzenl (119961 ) for an in-depth discussion of the prop- 
erties of the IPS algorithm. 

le IPS a l gorith m takes deterministic updates and therefore converges to a unique ma- 



T 

trix. 



Piccionil ( l2000l ) e xtends the IPS idea to create an MCMC sampler for }Vg{S,D). The 



block Gibbs sampler of lPiccionil ( 20001 ) works by starting with a K^^' G Pg and constructing 
a chain K^^' , K^'^' , . . . via the update 

where Kj is sampled from a VV(5, Dc )■ We thus see that each subblock Cj is being sampled 
from its full conditional according to ([T]), satisfying the requirements of a Gibbs sampler. 



2.3 A Direct Sampler for G-Wishart Variates 



We borrow ideas from Section [2.21 to specify a direct sampler for Wci^, D). First sam- 
ple K* ~ yV{5,D) and determine E = {K*)~^. Set K^^' = Ip and construct a chain 
K^'^\K^^\ . . . where K^'^ is updated from K^'-^^ via 



K(') = T, 



Cj,S, 



.,,^--■■■-Tc,,^-,liK^'-'^ (3) 

Eventually K^'"^' will conv erge to a r natrix K G Pg- We note that the key difference between 
our algorithm and that of lPiccionil (120001 ) is the point in which random sampling occurs. In 
the block Gibbs sampler, new matrices are sampled in each step of the IPS update according 
to the appropriate conditional distribution. In our framework, sampling occurs first, relative 
to the full model, independently of all previous samples, and the IPS is then run with a fixed 
target. 

The question remains what properties K has inherited from K* . Note that by the nature 
of these updates, we have that 



Kc, - Bc,{K \ Kc,) = K* - Bc,{K* \ K 



c, 



Sc' 



for j G {1, . . . , J}. This fact is critical. By properties of standard Wishart variates, we know 
that 

since this matrix has not changed, we similarly have that 



Kc, - Bc^{K \ Kc,) ~ W{5, Dc, 
5 



or, equivalently 



KcAK\Kc^r^W{S,Dc,,Bcm\Kc,)) 



for all J G {1, . . . , J}. We note that several properties of K* are not shared by K. For 
instance let F = Ci U C2 C y. We have that 



K*p - Bf{K* \ K*p) ~ W((5, Dp), 



while this does not hold for K since Kik = for any / G Ci and A; G C2 \ Ci. Thus, while 
the conditional distributions of K* are not fully transferred by ([2]), those that are relevant 
for Wg{S, D) variates are retained. 



The fact that K ~ VVg('^, D) then follows from iBrookl (jl964l ). In part, we have specified 
a sampler that has postive density over Pg and the conditional distributions along a complete 
partition of the parameter set E correspond to those of a Wg(5, D). 



2.4 Improving the Performance of the Direct Sampler 

The sampler discussed in Section 12.31 relied on the IPS algorithm to move from K* G Pp to 
K G Pg- While the IPS is useful in illuminating the properties of i^ G Pg, it is compu- 
tationally burdensome. This is for two reasons, the first of which is the requirement that 
the clique decomposition C be both determined and stored, an NP hard problem. Further, 
the matrix Bq {K\ Kq) must be determined at each step for all j, an action that requires 
yed. If the cliques of G are small, this matrix will be nearly p x p. 
(120091 ) discuss an alternative algorithm to the one described in Sect ion 12. 2[ 
which can be modified to determine K G Pg from K* G Pp (JMoghaddam et al.l . |2009| . discuss 

^ G 

its use in determining K ). It works in the following manner 

1. Set VF = S. 



K 



nOi 



to be so^ 



Hastie et al. 



2. For J = 1,..., J 

a. Let Nj C F be the set of neighbors of node j in G. Form Wn^ and Sat^.j and 
solve 

b. Form /3j G MP~^ by copying the elements of (3* to the appropriate locations and 
putting zeroes in those locations not connected to j in G. 



c. Replace Wj,^j and W^jj with W_j^^jf3j. 

3. Repeat step 2 until convergence 

4. Return K = W~^. 



3 Double Reversible Jump 



The direct sampler discussed in Section 12.31 opens the possibility for a host of new appli- 
cations of the G-Wishart distribution in hierarchical Bayesian modeling. We focus on the 
problem of constructing a computationally efficient algorithm for mixing over the poste- 
rior pr{K,G\V), thereby forming a m odel averaged estima t e of K . We build upon the 
reversible ju r np alg orithms developed in iDobra and Lenkoskil ( 1201 ll ) and futher extended in 



DobraetaL 



(1201 ih . 



3.1 Reversible Jump and Related Algorithms 

Let G be given and suppose that K ~ Wci^ + n, D + U). Let $ be the upper triangular 
matrix such that #'$ = K, its Cholesky decomposition. The transformation from K to ^ 

has Jacobian 

p 

i- 

where I'f' = \{j : {i,j) G G,i < j}\ ( iRoveratd . 
K G Pg since its primary restriction is that 



j=i 



20021). Working with $ is useful when 



$ 



«j 



-^^$.$., 



for any {i,j) ^ G. Otherwise $jj G 
completion of $ as the action of using 
of G are specified. 



"■ 1=1 
while $, 



(4) 



G M for {i,j) G G. We refer to the 
to augment a matrix for which only the elements 



Dobra and Lenkoskil ( 120111 ) use this representation to move between neighboring graphs 



in the context of a larger MCMC Suppose that [K, G) is the current state of an MCMC 
chain, where K G Pc and we would like to attempt moving to G, which we assume to 
be equ al to G except for the additional edge {l.rn). The algorithm of 



Dobra and Lenkoski 



( I2OIII ) first determines $ from G, samples 7 ~ Af{^ij, a^ and forms $ where $ij = $jj for 



i = j or {i,j) G G, while $;m = 7- 4> is then completed according to G. This proposal is 
then accepted with probability niinja, 1} where 



a 



exp 



-^{k-K,D + U) 



<^uV2nag I^{6, D) 



exp( 



(5) 



2^2 



Subsequent to this move, the matrix K has typically been updated according to the accepted 
graph using MCMC methods, for instance the block Gibbs sampler . 



Several embellishments of the algorithm of iDobra and Lenkoskil (120111 ) have been devel- 
oped, including asymmetric model moves in the graph space and permuting the elements 
of K to increase acceptance (JDobra et al.l . 120111 ). noting that a conditional Bayes factor 
can be derived to ob viate the need for reversible jump when comparing neighboring graphs 
(jWang and Lil . |2012| ) and using notions of sp arse Cholesky decomposition s and node reorder- 
ings to reduce the time spent computing $ (jCheng and Lenkoskil . |2012| ). 

Each of these developments has proven to yield some improvement in performance in cer- 
tain situations. However, the most important technical problem with (|5]) revolves around the 
calculation of the normalizing co nstants Ip and Ir;. These fac tors require MC approximation 
( Atay-Kayis and Massaml . l2005l ). which IWang and Lil ( 20121 ) rather convincingly show fails 
in high dimensio ns. 



Wang and Lil (|2012|) propose an alter native approach, which borrows ideas from the 



exchan g e alg orithm (JMurray et al.l . |2006| ) and the double Metropolis-Hastings algorithm 
(JLiangl . |2010| ) to approximate this ratio. Unfortunately, the double Metropo lis-Hastings 
algorithm is not exact, though the approximation used by IWang and Lil (|2012[) appears to 
work well in practice for neighboring graphs. We note that the approach of IWang and Li 
(I2OI2I ) is not feasible if the graphs are not neighbors. 



3.2 The Double Reversible Jump Algorithm 



The exchange algorithm (JMurray et al.l . |2006| ) has proven a useful tool for general MCMC 
when working with models where the likelihood has an intractable normalizing constant. 
Wang and Lil (120121 ) discuss how to use the concept behind the exchange algorithm to aid 
model comparison, where prior distributions-like those of the G-Wishart-have a similar un- 
known normali zing constant that v aries according to the model. Unfortuntately, without a 
direct sampler IWang and Lil (120121 ) relied on the block Gib bs sampler to propose a version 
of the double Metropolis-Hastings algorithm (JLiangl . I2OIOI ). This approach should only be 



considered approximate, whereas the original exchange algorithm avoids normalizing con- 
stant calculations and still yields correct MCMC transition probabilities. 

With the existence of a direct sampler for Wg variates, however, we may use a modifica- 
tion of the exchange algorithm to avoid the normalizing constants in ([5]). We call this new 
approach double reversible jump. 

Suppose that G is the graph in the current state of an MCMC procedure and propose a 
new graph G. At the moment, assume that G is a neighbor of G with the additional edge 
(/, m) G G. We discuss the relaxation of this assumption in Section [51 The double reversible 
jump algorithm then proceeds by 

1. Sample K ~ yVciS + n, D + U) and form $, its Cholesky decomposition. Let d = ^im- 

2. Sample K ~ Wq{6,D) and form $ . Let 



^ = -iE*°.4 



I 



rm 



3. Sample 7 ~ M{;&, a|) and set 7 = $°^ - 'd 

4. Form $ where $ij = $ij for (z, j) E G 01 i = j and set $zm = 7- Complete $ according 
to G and set K = # $ 

5. Form $° where $°j- = <l>°j- for (i,j) G G and $°^ = -d. Complete $° according to G 
and set K" = ($°)'$°. 

6. Accept the move from G to G with probability min{l, a} where 
« = / . ~n : ^^jf exp 



exp(-i(K°-ii:°,D) 



n ^ V 2a,2 



We see that the double reversible jump algorithm considers switching between 

(K,G,K°,G) 

to the alternative 

(K,G,K°,G) 

by performing two reversible jump moves, one that moves between {K, G) to {K, G) accord- 
ing to the posterior parameters 5 + n and D + U and the other between {K , G) to (K^, G) 



according to the prior parameters S, D. By doing so, the prior normahzing constants in 
dS]) cancel, r naking double r evers ible jump the transdimensional equivalent to the exchange 
algorithm of iMurray et al.l ( l2006l ). 



4 Examples 

4.1 Sampling from a fixed, low-dimensional model 

We begin with a simple sanity check to ensure that the direct sampler of Section fl^ returns 
identical results as the block Gibbs sampler when both are run for an exceedingly long time. 
We set p = A and G = C4 the four cycle where edges (1,4) and (2,3) are missing. We then 
consider sampling from Wc4,{S, D) where we set 5 = 103 and 



D 



which was randomly generated to resemble the posterior distribution after observing 100 
samples drawn from a A/4(0, I4). We then ran the block Gibbs sampler as well as the direct 
sampler for 10 million iterations each, with an additional one million iterations for the block 
Gibbs sampler as burn-in. The expection of K taken over the samples from the block Gibbs 
sampler was 



136.431 


-10.15 


8.027 


2.508 


-10.15 


93.417 


-2.122 


-16.162 


8.027 


-2.122 


116.652 


11.62 


2.508 


-16.162 


11.62 


120.203 



0.7788 


0.0827 


-0.0516 





0.0827 


1.1594 





0.1528 


-0.0516 





0.9122 


-0.0864 





0.1528 


-0.0864 


0.9025 


over samples from the proposed dire 


0.7788 


0.0826 


-0.0516 





0.0826 


1.1593 





0.1527 


-0.0516 





0.9122 


-0.0863 





0.1527 


-0.0863 


0.9024 



As shown above, the expectations of the two samplers appear identical. Further, we note 
that all other comparisons we could consider-for instance element-wise variance, quantiles of 



10 



determinants, medians-likewise returned identical results. Different fixed graphs and choices 
of 5 or Z? do not affect the results. 

Since 10 million samples of the block Gibbs sampler after a one million sample burn-in 
should be expected to characterize a Wci{6,D) distribution, this brief study appears to 
confirm that our proposed sample is indeed a direct sampler for G-Wishart variates. 



4.2 Fisher's Iris Data 



Both 



Roveratd (120021 ) and lAtay-Kayis and MassamI (120051 ) use Fisher's Iris Virginica dataset 



to confirm their approximations of Iq- These data consist of four measurements. Sepal 
Length (SL), Sepal Width (SW), Petal Length (PL) and Petal Width (PW), taken on 50 iris 
plants. We use these data to compare the double reversible jump algorithm to an exhaus- 
tive sc oring of all models using the Monte Carlo approximation in lAtay-Kayis and Massam 
J2OO5I ). We set 6 = 3, D = Ip and a| = I. 

For all 64 models i n the graph space, we dete r mine the model probability by running the 
MC approximation of lAtay-Kayis and MassamI (120051 ) for one-million iterations. Further, 
we run the double reversible jump algorithm for five-million iterations and discard the first 
100,000 iterations as burn-in. This takes approximately ten minutes on a 2.8 gHz desktop 
running Linux and should be recognized as an extremely long chain. Table [1] shows the 
pairwise edge probabilities returned from the two methods. As shown in the table, the 
estimated edge probabilities using the two approaches agree. We note that if the double 
reversible jump chain is run for less time, say 50,000 iterations (which takes approximately 6 
seconds), results are nearly, but not perfectly, identical. Model moves are accepted in 23.2% 
of attempts, alternative choices of cr^ appear to have marginal effect on this acceptance level. 
See Section \5\ for a discussion of ways to improve mixing through more involved schemes. 

These results indicate that the double reversible jump, coupled with the direct sampler, 
enables model a veraged estimates of K to be f ormed without needing the unstable MC 
approximation of lAtay-Kayis and MassamI ( 20051 ). 



5 Conclusions 

We have proposed a direct sampler for G-Wishart variates, which promises to dramatically 
improve the usefulness of this distribution. In this note we have focused on using this 
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sampler to develop a trandimensional MCMC algorithm that has no normalizing constant 
evaluations. While this is a promising first step, there are considerable additional avenues 
for development. 

While the direct sampler performs well, in our mind the entire process is still too slow. 
In high dimensions, the majority of computing time is spent moving from K* G Pp to 
K G Pg- While the development in section 12.41 is considerably faster (and dramatically 
more stable) than the use of the IPS algorithm, we feel that there must be potiential for 
further improvements. Conn ecting with the rapid development of procedures for forming 
glasso (JFriedman et al.l . |2008| ) estimators would be fruitful in improving the efficiency of the 
sampler in high dimensions, since this action can be phrased as a constrained optimization 
pr oblem. 



Rodriguez et al.l ( 120111 ) consider embedding the G-Wishart distribution inside Dirichlet 



processses and related structures from nonparameteric Bayesian methods. However, de- 
composable graphs were used, since a direct sampler was unavailable for nondecomposable 
models and is critical in the posterior sampling of nonparameteric models. It is now possible 
to consider the use of general graphical models in Bayesian nonparametric approaches. 

Our development of the double reversible jump algorithm was partially to show how the 
direct sampler could be used to avoid prior normalizing constant evaluations when com- 
paring models. A host of embellishments could be made. When co mparing neighboring 
gr aphs, for instance, condition al Bayes factors could be computed as in IWang and Lil (120 12[ ) 



or 



Cheng and Lenkoskil (J2012[ ). In our mind, a more promising avenue for development would 



be to construct a procedure for global moves in the graph space. In order to work properly, 
we feel that global moves must be coupled with better proposals in the double reversible 
jump scheme. Relating these proposals to some of the guidelines in iRue and Heidi (120051 ) 
could prove useful in this regard. The ability to make large, focused moves in the graph space 
will be critical to extending the G-Wishart distribution to truly high dimensional problems. 
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Table 1: Pairwise edge probabilities from Monte Carlo (lower triangle) and double reversible 
jump (upper triangle) in the iris dataset 



SL SW PL PW 

SL 1 0.821 1 0.405 

SW 0.821 1 0.501 0.987 

PL 1 0.501 1 0.532 

PW 0.406 0.987 0.532 1 
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